Client Report - Can You Predict That?

Course DS 250

Author

Ryan Lee

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *

# add the additional libraries you need to import for ML here
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


LetsPlot.setup_html(isolated_frame=True)
Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here

# import your data here using pandas and the URL

df = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_ml/dwellings_ml.csv")
df2 = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")
dfc = pd.read_csv("https://raw.githubusercontent.com/byuidatascience/data4dwellings/master/data-raw/dwellings_denver/dwellings_denver.csv")

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.

With the first chart, we are using the first csv file to see the number of bedrooms before 1980. There are many 2 and 2 and a half. So being able to see, I merged the first the second csv file to show the stats before and after 1980. With this bar graph, we can see a lot more after the 1980’s

Show the code
df.head()
df2.head()
dfc.head()

graph1 = (ggplot(df, aes(x='numbdrm', fill='before1980')) + geom_bar(position='dodge'))


dfOneTwo = df.merge(df2, on="parcel", how="left")
dfOneTwo["before1980"] = dfOneTwo["before1980"].astype(bool)
dfOneTwo["era"] = dfOneTwo["before1980"].map({True: "Before 1980", False: "After 1980"})
chart = dfOneTwo[["livearea", "nocars", "numbdrm", "era"]]

graph2 = (ggplot(chart, aes(x=as_discrete("era"), y="livearea", fill="era")) + geom_bar())

graph1.show()
graph2.show()

QUESTION|TASK 2

Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy. Explain your final model choice (algorithm, tuning parameters, etc) and describe what other models you tried.

So I made a before1980 variable for trainY. I then used features for the trainX. The precision was about a 86% accuracy, which didn’t hit the mark for 90%, but overal using the information I gave it, that’s pretty good.

Show the code
dfOneTwo = df.merge(df2, on="parcel", how="left")
dfOneTwo["before1980"] = dfOneTwo["before1980"].astype(int)

selected_tables = ["livearea", "numbdrm", "numbaths", "nocars", "stories", "finbsmnt", "garagesqft", "lotsize"]
features = [s for s in selected_tables if s in dfOneTwo.columns]

df_merge = dfOneTwo[features + ["before1980"]].dropna(subset=["before1980"])
x = df_merge[features]
y = df_merge["before1980"]

missData = SimpleImputer(strategy="median")
x_missData = pd.DataFrame(missData.fit_transform(x), columns=x.columns, index=x.index)

x_train, x_test, y_train, y_test = train_test_split( x_missData, y, test_size=0.2, random_state=42, stratify=y )

randomEsimate = RandomForestClassifier(n_estimators=200, max_depth=12, random_state=42)
randomEsimate.fit(x_train, y_train)

y_pred = randomEsimate.predict(x_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred, digits=4)

print(f"Random Forest accuracy: {acc:.4f}")
print("Confusion matrix:\n", cm)
print("\nClassification report:\n", report)
Random Forest accuracy: 0.8739
Confusion matrix:
 [[1769  362]
 [ 343 3119]]

Classification report:
               precision    recall  f1-score   support

           0     0.8376    0.8301    0.8338      2131
           1     0.8960    0.9009    0.8985      3462

    accuracy                         0.8739      5593
   macro avg     0.8668    0.8655    0.8662      5593
weighted avg     0.8738    0.8739    0.8738      5593

QUESTION|TASK 3

Justify your classification model by discussing the most important features selected by your model. This discussion should include a feature importance chart and a description of the features.

This chart shows 6 different features that were important with the model.

Show the code
importance_df = (
    pd.DataFrame({"Feature": x_missData.columns, "Importance": randomEsimate.feature_importances_})
    .sort_values("Importance", ascending=False)
    .head(15)
    .reset_index(drop=True)
)

# Create Label column for plotting (just copy Feature names)
importance_df["Label"] = importance_df["Feature"]

# Make Label an ordered categorical so the plot keeps the importance ordering
importance_df["Label"] = pd.Categorical(
    importance_df["Label"],
    categories=importance_df.sort_values("Importance")["Label"],
    ordered=True
)

# Plot using lets-plot
plot = (
    ggplot(importance_df, aes(x="Label", y="Importance"))
    + geom_bar(stat="identity", fill="#4C72B0")
    + coord_flip()
    + xlab("Feature")
    + ylab("Importance Score")
)

# Show the plot
plot.show()

QUESTION|TASK 4

Describe the quality of your classification model using 2-3 different evaluation metrics. You also need to explain how to interpret each of the evaluation metrics you use.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 1

Repeat the classification model using 3 different algorithms. Display their Feature Importance, and Decision Matrix. Explian the differences between the models and which one you would recommend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 2

Join the dwellings_neighborhoods_ml.csv data to the dwelling_ml.csv on the parcel column to create a new dataset. Duplicate the code for the stretch question above and update it to use this data. Explain the differences and if this changes the model you recomend to the Client.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 3

Can you build a model that predicts the year a house was built? Explain the model and the evaluation metrics you would use to determine if the model is good.

type your results and analysis here

Show the code
# Include and execute your code here